Selecting fast folding proteins by their rate of convergence. 
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Abstract 

We propose a general method for predicting potentially good folders from a given number of 
amino acid sequences. Our approach is based on the calculation of the rate of convergence of each 
amino acid chain towards the native structure using only the very initial parts of the dynamical 
trajectories. It does not require any preliminary knowledge of the native state and can be applied 
to different kinds of models, including atomistic descriptions. We tested the method within both 
the lattice and off-lattice model frameworks and obtained several so far unknown good folders. The 
unbiased algorithm also allows to determine the optimal folding temperature and takes at least 
3-4 orders of magnitude less time steps than those needed to compute folding times. 
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It is well-known, that most proteins fold rapidly and reliably to a unique native state 



from any of a vast number of unfolded conformations 

HQ. 

One of the main problems 

in protein folding is described by the so-called Levinthal paradox, which states that if the 
folding pathway of a protein in the phase space would be governed by a random search the 
time needed to locate the native state among all configurations would exceed the age of the 
universe. Nowadays, the consent answer to this paradox is found in the designed energy 
landscape of a foldable protein, which resembles a many-dimensional funnel, where moving 
along the free-energy gradient narrows the accessible configuration space and guides to the 



unique native structure, which lies at the bottom of the funnel js - 5 ] . The funnel is also 
rough, giving rise to local minima, which can act as traps during folding. In contrast to a 
designed protein, a random amino acid chain will not fold to its global free-energy minimum 
in times less than that needed to explore the configuration space completely, the times, 



which are astronomically large 



a. 



In this paper we call good folders those amino acid sequences, which exhibit a protein- 
like behavior, i.e. those that fold into the unique native state within a reasonable time. To 
find a way of characterizing good folders, like typical motifs in the amino acid sequence or 
specific properties of the energy landscape is of vital importance. A widely used criterion to 
characterize a good folder is a pronounced energy gap between its global energy minimum 
and the energies of configurations, which are structurally dissimilar to the configuration of 
the global minimum energy gap ensures the "thermodynamic 

stability" and one finds a correlation between the energy gap and the ability to fold into 
the global minimum within a reasonable time. Yet, without knowing the native state, there 
is still no good way to check whether a given amino acid sequence is a good folder other 
than letting it dynamically evolve from various initial conformations and checking if it does 
actually fold into a unique native state. Due to an unknown folding time it may take very 
long before one could identify some amino acid chain as a bad folder. 

Many studies have been devoted to the search of determinants of a protein-like system. 
Apart from the energy gap, one could mention the relation between the folding and glass 



transition temperatures, see f. e. js], the collapse cooperativity |9|], etc. In this respect it is 
important to understand how these features, which are characteristic to foldable proteins, 
could help distinguish a good folder from a bad folder. Not always a clear determinant of a 
good folder can serve as a criterion for selection of protein-like aminoacid sequences. It turns 



out that in order to do a fast selection in most cases one needs to know the global minimum 
(native state) from the beginning. The energy gap clearly assumes the knowledge of the 
energy in the native state. In 10[ the authors use the microcanonical ensemble to distinguish 
good folders from bad folders but the efficient procedure also requires the knowledge of the 
global ener gy minimum. In simple Go-like models 11], where similar problems have been 
posed (c.f. |12|-|l4j]). the model space as a whole is biased by the predetermined native state. 

In [ijj] the authors propose an interesting idea to study the fluctuations of the energy 
landscape curvature (this requires a smooth energy surface). This idea was tested on the 
off-lattice model with three amino acids; the description of the model and some of the good 
folders can be found in 16j. It turns out that the averaged curvature of potential energy 



Kr := V 2 U of a foldable protein suffers a dramatic enhancement of the fluctuations in the 
vicinity of the folding temperature T — Tf. This direction of research was further pursued in 
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17| . Thereby, the preliminary knowledge of the native state is not necessary. Successful 



selection of good folders in [15|, ll7| was done from only 6 sequences, which is too little to 



make a comparative analysis and to judge on the effectiveness of the method. It is also 
important to note that the curvature is averaged almost along the whole folding pathway, i. 
e. over the whole folding time (the folding time can be found in 16|). Sometimes the energy 



landscape is funneled towards several deep minima, and since the approach in 

HQ 

IS 

purely local, it is unclear how one can distinguish good folders from bad folders in this case. 
Presumably, this method works well when one compares a funneled and a totally frustrated 
energy landscape, which was indeed the case in [ijj, \v\ . 

In this paper using lattice and off-lattice models we investigate to which extent the 
convergence of dynamical trajectories in configuration space on early stages could serve a 
distinguishing criterion for a good folder. We emphasize that the knowledge of the native 
state is not required! One can illustrate the idea using a suitable analogy to convergence 
criteria for a sequence of real numbers. On one hand, by definition, a sequence A n e R 
for n = 1,2,... converges if there exists Aq e R such that for all e > one can find N, 
which guarantees that \A n — A \ < e holds for n > N. Equivalently, on the other hand, 
the sequence A n converges if for all e > one can find N so that \A n — A m \ < e holds for 
n,m > N. In the first case one needs to know the exact limit of a sequence (read native 
state). In the second case one does not have to know the limit of a sequence, and similarly, 
it is not necessary to know the native state in our approach. 



There are various ways to describe the dynamics of an amino acid chain in the solvent 
(Langevin dynamics for atomistic models 18], Monte Carlo (MC) dynamics for lattice mod- 



els [7 



191 ]. etc.). Generally, the time development of the configuration can be written as 



€(t) = g t C(0), where (C(0) is the initial configuration and g l denotes the dynamical trans- 
formation, which depends on temperature and has a probabilistic nature if it simulates how 
water molecules affect the amino acid chain. 

The effect of the folding funnel could also be expressed in terms of the dynamical trans- 
formation, saying that if the dynamical transformation acts on two arbitrary points in the 
configuration space then the "distance" between them becomes contracted d(€i(t), £2^)) < 
d(<£i(0), £2(0)), where d stands for "distance" between configurations. The time t should 
surpass the minimal time required for overcoming typical local traps in the folding funnel. 
This expresses the idea that if one considers a good folder in two randomly chosen initial con- 
figurations and lets it dynamically propagate over a proper time, then there should emerge 
structural similarities between two propagated yet initially unrelated configurations. 

Now imagine the following problem being posed: out of K amino acid sequences one has 
to sort out the best candidates for folding in some reasonable time. The brute force solution 
to this problem would be to let each amino acid sequence evolve according to the dynamics 
starting from various random initial configurations and to check whether the dynamical 
trajectories reach the same native conformation. This may be, however, extremely time 
consuming (especially in the case of molecular dynamics simulations with water molecules 
included). In addition, it is a priori unclear how long the dynamical simulation must be 
run because the folding time is initially unknown. Moreover, the native contacts must not 
be necessarily known for an arbitrary sequence, which prevents the application of go-type 
models. In this paper we propose an alternative solution to this problem based on comparing 
amino acid sequences through their rate of convergence. To define the rate of convergence 
for a given amino acid sequence S we proceed as follows. 

Suppose, the pairwise interaction between two monomers is V^r^), where is a relative 
coordinate between two monomers. Let us extract the negative part of the potential function 
setting Wij(r) := max[0, —Vij(r)] and define the magnitude of a contact between aminoacids 
i and j as 

V«(r):= W l)./ ( (forj^i-l,M + l), (1) 
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and Vij(r) := for j — i — + 1 (in the expression for Vy(r) we exclude the bulk 
contributions from neighboring monomers). Clearly, < V^r) < 1. Let r^f and r^ 2) 
denote in the configurations (Ci and (£2 respectively. Then the overlap between two 
configurations £1 and C 2 is defined as 

N 

0(C 1 ,C a ) = X)Fy(rW)yy(r«), (2) 

where iV is the number of aminoacid molecules in the protein. The overlap introduces 
the topology in the space of configurations. Note that the more compact and structurally 
similar two configurations are the larger is the overlap between them. Eqs. ([1]) and ([2]) are 
quite general and can be applied to any force field. As a particular case, for lattice models 
V(rij) = 1 if the monomers are "in contact" in the given configuration and zero otherwise. 
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2l|. 



For various definitions of contact see, for example 

Next, let us fix some time scale to, which should be larger than the typical time required 
for the dynamically evolving configurations to overcome local minima on the energy surface. 
We then let a given amino acid chain dynamically propagate over the time to starting from 
two randomly chosen initial configurations (self-avoiding random walks on the lattice) £1 and 
£2- The overlap between the resulting configurations g to C\ and g to ti is then O(g to <£i, g* ^)- 
Sampling over randomly chosen initial configurations (£4 and £2 we calculate the arithmetic 
mean of the overlaps, which we denote as R(to, T) and call the rate of convergence of the given 
amino acid sequence. Here T denotes the temperature (the dependence on T is hidden in 
the dynamical transformation). Below we would show that the rate of convergence R(t ,T) 
can be used to select and design good folders. (In order to give a proper dimension to the 
rate one could divide R(t , T) by t ; we do not do this because this rescaling does not affect 
the results). Let us remark that since the proteins coil into the native state from any initial 
configuration, we impose no restrictions on the domain of initial configurations. 

Now we take the next step and construct the normalized rate of convergence. For this 
purpose we first generate a large number of random amino acid sequences and calculate 
R(to,T) for each sequence, where to and T are fixed time of evolution and temperature 
respectively. The arithmetic mean of these values we denote as R ra ndom(to, T). This quantity 
is the expectation value of the rate of convergence of a random sequence depending on 
temperature and on the time scale t . The normalized rate of convergence Rpf(to,T) of an 



aminoacid sequence S is then denned as 



Rn&jT) — R(to,T)/R ran dom(toiT). (3) 

Let us remark that the values of R ra ndom(to,T) can be tabulated so that Rjsr(to,T) can be 
determined with the same computational effort as R(t ,T). 

If an amino acid sequence has RN{to,T) > 1 then its rate of convergence is larger than 
that of a random sequence; the converse is also true. The normalized rate of convergence can 
be assigned to any amino acid sequence and the larger R]y(to,T) the better are the chances 
for this sequence to be a good folder. Therefore, the best candidates for being a good folder 
from a number of given amino acid sequences can be found through sorting the sequences by 
their normalized rate of convergence. The degree to which this sorting algorithm is effective 
depends on how to, which is sufficient for proper sorting, relates to the mean folding time. In 
the following we demonstrate that the selection and design of good folders using the rate of 
convergence works for both a standard lattice and an off-lattice models of proteins ,|22|. 



Although geometrically poor, the lattice model is protein-like in the sense that lattice 
proteins fold to a unique native structure from an astronomically large number of possible 
initial conformations and do so rapidly and reproducibly. A random configuration is then 
a self avoiding random walk on the cubic lattice. The sequences are composed of 20 amino 
acids. Two monomers are "in contact" if they occupy neighboring positions on the lattice 
but are not sequence neighbors. The energy of two monomers in contact is calculated using 
the 20 x 20 Miyazawa-Jernigan matrix (Table VI in |23]). The dynamic transformation g i 
is implemented through the Monte Carlo dynamics [22| with move set including end moves, 
corner flips, and crankshaft moves. 



We have chosen a designed sequence 



24) of 36 monomers S = SQKWLERGATRI- 



ADGDLPVNGTYFSCKIMENVHPLA. The native state of S has the energy % = -16.5 



in dimensionless kBT room units, where T room stands for the room temperature [23J. At the 
folding temperature Tf = 0.25 (in Miyazawa-Jernigan dimensionless units) the configura- 
tion So always reaches its native state starting from any conformation and the mean fold- 
ing time (obtained by sampling 10 3 self-avoiding random walks in initial configurations) is 
*/ = 1.5 x 10 6 steps. 

In our calculations we have generated 800 sequences with a random amino acid decompo- 
sition and the designed sequence S was hidden among random sequences as "a needle in a 



haystack" . For each amino acid sequence we calculated the normalized rate of convergence 
and then sorted all sequences by the corresponding value in descending order. We computed 
-Rjv(t(b Tf), where Tf = 0.25 is the folding temperature of S , over 500 randomly chosen pairs 
of positions (conformations), starting with t = 50 and repeated the procedure incrementing 
each time to by 50. The initial conformations are generated as self-avoiding random walks 
in the lattice. We stress that for each new time period the 800 random sequences were 
generated anew. We have observed that further increase of the number of random sequences 
changes the value of R r andom(to,T) by ±1% in the considered range of t ,T. Recall that 
these values can be obtained once with a high accuracy and then tabulated for various values 
of to , T, N, where N is the number of monomers. 

In general, for £ — 150 the designed sequence gets lost among other random sequences, 
indicating that the time to < 150 is insufficient for overcoming local minima through poten- 
tial barriers. For t > 200 the sequence So gets into the top ten, which makes us conclude 
that t > 200 is sufficient for distinguishing the sequences by their ability to fold. The 
dependence of normalized rate of convergence on the temperature T for fixed to is also a 
relevant quantity. Remarkably, R,N(to,T) of So peaks exactly at the folding temperature 
Tf, see Fig. [TJ 

In order to show that the rate of convergence can also be used to perform sequence 
design we applied the algorithm to 5000 randomly generated amino acid sequences having 
36 monomers. The top 5 sequences turned out to be good folders. We used to = 200 
and the sampling was done over 300 pairs of initial positions. The temperature was set 
to the folding temperature of the designed sequence So, namely T = Tf. Interestingly, 
the sequence So occupied only the position 3. The two top folders found correspond to 
the sequences Si = KWEEHEWGKDNLSDLHMHENEERFAQEQHNRDPQTD and S 2 = 
NALCDDCSTEWCIPSMCCMCFEFIDFYKKKQQWRQM. The native states of Si and S 2 
are shown in Fig. 4. The energies of the native states are E Nat (Si) = —16.88 and E Nat (S 2 ) = 
— 14.29 respectively. Note that ENat{Si) is even lower than that of the previously known 
sequence S , despite the fact that Si has the number of native contacts by 6 less than S 
(note that the structure of So was specifically designed to maximize the number of native 
contacts and 40 native contacts is the maximal reachable number for the sequence length 
of 36 monomers). Fig. 1 shows the normalized rate of convergence for the sequences So and 
Si as a function of temperature. In the given temperature range the normalized rates of 
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convergence for Si is larger than that of So- The same occurs for S2 (not shown in Fig. 1). 

Both newly found sequences Si 2 have the folding temperature equal to Tf and their 
folding time is approximately 50 times longer than the folding time of So- This is the fact 
which deserves a discussion: in spite of Si,2 having at all temperatures a better normalized 
rate of convergence compared to So, their folding time is substantially longer. In [24] one 
finds the procedure for the sequence design, where one fixes the target conformation and 
finds the amino acid sequence, which minimizes the energy in this conformation. The target 
structure then becomes the native state for the obtained good folder. The same design works 
also in the case of off-lattice models 2fj] . The sequence design in our approach does not fix 
the native conformation but rather fixes the target temperature. The obtained good folders 
have the folding temperature equal to the target temperature! 

In addition, we applied our method to other sequences already designed by other authors. 
For instance, for the sequences in Figs. 1,2 of Ref. 25[ the method yields excellent results. 
In Fig. 1 we also plot the rate of convergence versus temperature for the sequence S3 = 



GY LGEIW KIMW AEMMKSW MSGW KGGEMGEW LKGIKG (Fig. 2 in |25j). The 
curve peaks exactly at the folding temperature. 

As we have mentioned before, the rate of convergence R(to,T) of a given sequence is 
calculated by sampling over randomly chosen pairs of initial conformations. If one consider 
100 pairs of random initial conformations then the distribution of the overlaps for to = 300 
and T = Tf is almost Gaussian (as it should be in the perfect case according to the central 
limit theorem). 

We now demonstrate that the method proposed here is also able to characterize and design 
good folders in the more sophisticated off-lattice model of proteins proposed by Clementi et 
al. in 



In this force field the interaction between amino acids % and j is given by [27| , 



Vij = 5 itj+1 a(rij - r ) + (1 - S iij+1 )Ae 
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r ij J \ r ij 



(4) 



where a = 50 A -2 and r = 3.8 A. The set of parameters e and a denote the minimum energy 
and the equilibrium distance for the Lennard- Jones (LJ) part of the potential. We considered 
Nconf sequences (with iV con /=100) of iV = 30 monomers. To compute the time evolution 
g l of the monomers we used Monte Carlo dynamics. The overlap between configurations 
was computed using Eq. (T5]) and the rate of convergence was obtained by averaging over 
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N C onf x N con f/2 = 4950 pairs of randomly chosen conformations, which were determined 
as follows. First, we have chosen random positions for the monomers in the range [0:16] 
in units of distance without any bias. Then, the so generated structures were equilibrated 
during 2000 Monte Carlo steps, thus generating the starting structural configurations. 
We analyzed 6 sequences (see Table [I]) belonging to 3 different polymer types according 



to the classification in 



271 ] . We considered 3 sequences of heteropolymer character (DHTP), 



labeled as SEQ1, SEQ2 and SEQ3, 2 sequences of random heteropolymers (RHTP) (SEQ4 
and SEQ5) and the homopolymer (SEQ6). In general, heteropolymers designed following 
the procedure introduced in 27j have good chances to be protein-like, whereas for random 
heteropolymers and for homopolymers one expects a rugged energy landscape and conse- 
quently a bad folding behavior. 

Note that SEQ1 has been shown to be a good folder, whereas SEQ4 and SEQ6 have 



been previously characterized as bad folders 



27|. The sequences SEQ2, SEQ3 and SEQ5 



generated by us in this work were not considered so far in the literature. 

The rate of convergence clearly allows one to separate good folders from bad ones already 
at almost any step of the dynamical simulation. Fig. 2 shows the rate of convergence as a 
function of time for the 6 studied sequences at fixed temperature. From the inset of Fig. 2 
one can see that good folders can be identified already after less than 10 4 time steps, i.e., 
at an early stage of the dynamical transformation g l . At folding temperature our method 
allows for a selection of good folders by computing trajectories at least 3 to 4 orders of 
magnitude smaller than those needed to compute the folding time. 

In Fig. 3 we show the temperature dependence of the normalized rate of convergence 
Rx(to,T) for the 6 sequences studied. The values of R r andomiio^T) were computed using 
100 random sequences; further increase of the number of random sequences changes the 
value of R r andom(to, T) by ±2.5% in the considered range of to, T. Let us stress that one can 
get a better accuracy for R ra ndom(to,T) using a larger number of random sequences; this 
does not affect the effectiveness of the method since for all models the values R ra ndom(to, T) 
can be tabulated after being calculated once. 

The normalized rate of convergence was computed over 100 random sequences SEQ1, 
SEQ2,..., SEQ100, from which SEQ1, SEQ2 and SEQ3 belonged to the DHTP model, SEQ6 
was a HMP and the rest of the sequences were random heteropolymers (RHTPs). The 
different functional dependence of good and bad folders is very clear. For good folders 



.Rat (t , T) is larger than 1 at all temperatures and exhibits a well defined maximum, whereas 
for bad folders RN(t ,T) ~ 1 and practically does not depend on temperature. 

In order to investigate whether the temperature dependence of RN(to,T) is also physi- 
cally relevant as in the case of the lattice model, we performed Wang-Landau Monte Carlo 
simulations to calculate the specific heat curves of the three good folders. Results are dis- 
played in the low panel of Fig. 3. The specific heats of SEQ1, SEQ2 and SEQ3 show the 
typical peaked shape at the folding temperatures 7/(SEQj), % = 1,2,3, characteristic of 
protein-like sequences. By comparing the upper and lower panels of Fig. 3 one concludes 
that from the position of the maxima of i?jv(io — 10 7 ,T) one obtains a reasonably good 
approximation to the folding temperatures. In order to obtain smooth curves of Rn vs T 
as those shown in Fig. 3 one has to take large values of to- From Fig. 3 it is clear that for 
each sequence RN{to,T) exhibits a broad maximum around Tf. Again, let us stress that 
the rate of convergence is not only efficient in distinguishing good and bad folders but also 
accurately predicts the suitable temperature range for a good folder. 

Finally, we demonstrate that the new sequences SEQ2 and SEQ3, designed using the 
method of the rate of convergence, are indeed foldable. We computed the average root 
mean square deviation 



N t 



2 x x 



^EEl^-W)! 2 . (5) 



j>i 



where rf at refers to the intermonomer distances in the native state and N con f = 100 to 
the number of initial conformations we average over. In Fig. 4 we show the behavior of 9, 
averaged over 100 independent trajectories, as a function of log 10 (t) for sequences SEQ1, 
SEQ2 and SEQ3. We can define the folding time as the time when 9 approaches a certain 
threshold value 9 thr . We set 9 thr ~ 3.9 A, which allows to estimate the folding times as 
tf(SEQl) = 4.5 x 10 6 time steps, t f (SEQ2) = 6.6 x 10 5 time steps, and t f (SEQ3) = 
2.4 x 10 7 . 

The three dimensional structures of some of the sequences designed in this work using the 
rate of convergence are shown in Fig. 5. Note that the main conclusion of this paper, namely, 
that the computational time required by the method of the rate of convergence is many 
orders of magnitude less than the folding time remains valid even taking into account that 
the definition of R involves sampling over many different initial conditions. Such sampling 
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TABLE I. The six sequences studied in this paper and their corresponding models. All the se- 
quences have N = 30 monomers. The numbers in the second column denote the sequence of amino 
acids in the peptide chain (using the same notation as in Ref. |26| . 



Name Sequence Model 

SEQ1 311114442344312212224434333334 DHTP 

SEQ2 341233331323231121112421234111 DHTP 

SEQ3 443234423233421321132243424311 DHTP 

SEQ4 414124323443321423324242141441 RHTP 

SEQ5 444444444444444444444444444444 RHTP 

SEQ6 321224314333113213344411112243 HMP 



operations can be run absolutely parallel on as many different nodes as initial conditions 
one needs. Let us, however, mention that the procedure presented here is, indeed, a good 
method to identify potentially good folders, but it cannot serve as an ultimate measure of a 
good folder. 

The method of the rate of convergence developed in this paper is applicable in all model 
frameworks which allow for dynamics, including accurate atomistic descriptions. Note that 
the rate of convergence R can also be computed basing on arbitrary definitions of overlap, 
different from Eqs. (1) and (2). Moreover, it must not be restricted to the coordinate 
(structural) space. One could, for instance, consider the overlap between strings containing 
property factors 28| or their Fourier components 29 ]. 

The authors express their gratitude to Dr. Guido Tiana for providing his lattice-model 
dynamics software. 



* Electronic address: gridnev@has.uni-frankfurt.de 
t Electronic address: garcia@physik.uni-kassel.de 



* On leave fromSt Petersburg State University, Uljanovskaja 1, 198504 St Petersburg, Russia 
[1] T. Creighton, Proteins Structure and Molecular Properties (Freeman, New York, 1992). 
[2] A. V. Finkelstein and O. B. Ptitsyn, Protein Physics: A Course of Lectures, (Academic Press, 



11 



New York, 2002). 

[3] R. Goldstein, Z. A. Luthey-Schulten, and P. Wolynes, Proc. Natl. Acad. Sci. U.S.A. 89, 4918 
(1992). 



[4] 
[5] 

[6] 
[7] 
[8 
[9 
[10 

[11 
[12 

[13 

[14 
[15 
[16 
[17 
[18 
[19 
[20 
[21 



[22 

[23 
[24 

[25 

[26; 



A. Sali, E. I. Shakhnovich, and M. Karplus, J. Mol. Biol. 235, 1614-1636 (1994). 

J. Bryngelson, J. N. Onuchic, N. D. Socci, and P. Wolynes, Proteins: Struct. Funct. Genetics 
21, 167 (1995). 

E. Shakhnovich and A. Gutin, Proc. Natl. Acad. Sci. U.S.A. 90, 7195 (1993); 

E. I. Shakhnovich, Phys. Rev. Lett. 72, 3907 (1994). 

M. Cieplak, T. X. Hoang and M. S. Li, Phys. Rev. Lett. 83, 1684 (1999) 

D. K. Klimov and D. Thirumalai, Phys. Rev. Lett. 76, 4070 (1996) 

J. Hernandez-Rojas and J. M. Gomez Llorente, Phys. Rev. Lett. 100, 258104 (2008) 

V. Tozzini, Curr. Opin. Struct. Biol. 15, 144 (2005) 

B. C. Gin, J. P. Garrahan, P. L. Geissler, J. Mol. Biol. 392, 1303 (2009). 
J. Kim, T. Keyes, J. E. Straub, Phys. Rev. E 79, 030902R (2009) 

L. Angelani and G. Ruocco, EPL 87 18002 (2009) 

L. N. Mazzoni and L. Casetti, Phys. Rev. Lett. 97, 218104 (2006). 

T. Veitshans, D. Klimov, and D. Thirumalai, Folding Des. 2, 1 (1997). 

L. N. Mazzoni and L. Casetti, Phys. Rev. E 77, 051917 (2008) 

M. K. Gilson, Proteins: Struct., Funct., Genet. 15, 266 (1993). 

H. J. Hilhorst and J. M. Deutch, J. Chem. Phys. 63, 5153 (1975). 

M. Vendruscolo, R. Najmanovich, and E. Domany, Phys. Rev. Lett. 82, 656 (1999). 

F. Birzele, J. E. Gewehr, G. Csaba, and R. Zimmer, Bioinformatics 23, e205-e211 (2007); I. 
Koch, Ein graphentheoretischer Ansatz zum paarweisen und multiplen Vergleich von Prote- 
instrukturen, Wissenschaft und Technik Verlag, (1998). 

R. A. Broglia, G. Tiana, H. E. Roman, E. Vigezzi and E. Shakhnovich, Phys. Rev. Lett. 82 
4727 (1999). 

S. Miyazawa and R. Jernigan, Macromolecules 18, 534 (1985). 

V. Abkevich, A. Gutin, and E. I. Shakhnovich, Biochemistry 33, 10 026 (1994); G. Tiana, R. 
A. Broglia, H. E. Roman, E. Vigezzi, and E. I. Shakhnovich, J. Chem. Phys. 108, 757 (1998). 
V. Abkevich, A. Gutin, and E. I. Shakhnovich, J. Mol. Biol. 252, 460-471 (1995). 

C. Clementi, A. Maritan and J. Banavar, Phys. Rev. Lett. 81, 3287 (1998). 

12 



°' 5 5 10 15 20 25 30 35 40 45 50 55 

Temperature 

FIG. 1. (Color online). Thick solid line: the normalized rate of convergence versus temperature 
for the designed sequence So for the time period to = 500. Dash-dot and thin solid line : the same 
for the seances S, and S 3 respective!, Note that the foidin g temperature of S 3 is approximately 
1.2Ty ~ 30 as can be seen from Figs. 9 (a,b) in [25J]. Dashed line: the normalized rate of convergence 
Sbad of a typical bad folder (in this case a homopolymer). The vertical dotted line corresponds to 
the folding temperature of So- The temperature is given in dimensionless Miyazawa-Jernigan units 
multiplied by 100 

[27] J. Hernandez-Rojas and J. M. Llorente, Phys. Rev. Lett. 100,258104 (2008). 

[28] A. Kidera, Y. Konishi, M. Oka, T. Ooi, and H. A. Scheraga. J Prot Chem 4, 23 (1985); A. 

Kidera, Y. Konishi, T. Ooi, and H. A. Scheraga. J Prot Chem 4, 265 (1985). 
[29] S. Rackovsky, Phys. Rev. Lett 106, 248101 (2011); Proc. Natl. Acad. Sci. U.S.A. 107, 8623 

(2010). 



13 




Ql 1 I I I I I 1 I I I 

2xl0 6 4xl0 6 6xl0 6 8xl0 6 lxlO ? 

Time Step 

FIG. 2. (Color online). Normalized rate of convergence Rjsr(to,T) vs time step to for fixed temper- 
ature T = 0.001987/c^ 1 of the 6 analyzed sequences in the off-lattice model (see the text). For each 
point, Rjsf(to,T) was calculated averaging over 100 conformation pairs. Inset: first stages of the 
time development of Rpf(to,T). The different behavior of good and bad folders is already evident. 
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FIG. 3. (Color online). Upper panel: temperature dependence of the normalized rate of conver- 
gence R]sf(to,T) for the 6 sequences considered within the off-lattice model, to = 10 7 time steps. 
Lower panel: specific heat curves of the sequences SEQ1, SEQ2 and SEQ3, characterized as good 
folders by our method. 
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FIG. 4. (Color online). Time evolution of the root mean square deviation 9 of sequences SEQ1, 
SEQ2 and SEQ3 (with respect to their global minimum structures). The value of 6 was computed 
according to Time axis is plotted in logarithmic scale. decays exponentially in time for the 
three sequences. Threshold value of 3.9 A is denoted by the dotted line. SEQ2 shows the fastest 
folding Monte Carlo dynamics followed by SEQ1 and SEQ3 (folding times are given in text). 
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FIG. 5. (Color online). Native state conformations for some of the sequences designed using the 
rate-of-convergence method developed in this work. Lower panel: S2 (left) and S3 (right) obtained 
in the framework of the lattice model. Dotted lines connect those monomers that are in contact. 
Upper panel: SEQ2 and SEQ3, designed within the off-lattice model. 
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